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We consider a nonparametric additive model of a conditional 
mean function in which the number of variables and additive compo- 
nents may be larger than the sample size but the number of nonzero 
additive components is "small" relative to the sample size. The statis- 
tical problem is to determine which additive components are nonzero. 
The additive components are approximated by truncated series ex- 
pansions with B-spline bases. With this approximation, the problem 
of component selection becomes that of selecting the groups of co- 
efficients in the expansion. We apply the adaptive group Lasso to 
select nonzero components, using the group Lasso to obtain an ini- 
tial estimator and reduce the dimension of the problem. We give 
conditions under which the group Lasso selects a model whose num- 
ber of components is comparable with the underlying model, and the 
adaptive group Lasso selects the nonzero components correctly with 
probability approaching one as the sample size increases and achieves 
the optimal rate of convergence. The results of Monte Carlo experi- 
ments show that the adaptive group Lasso procedure works well with 
samples of moderate size. A data example is used to illustrate the 
application of the proposed method. 

1. Introduction. Let (Y^, Xj), i = 1, . . . ,n, be random vectors that are 
independently and identically distributed as (Y, X), where Y is a response 
variable and X = {X\, . . . , X p )' is a p-dimensional covariate vector. Consider 
the nonparametric additive model 

p 

(1) Y i = fi + Y,fj(Xi J )+e i , 
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where [i is an intercept term, Xij is the jth component of Xj, the /j's are un- 
known functions, and is an unobserved random variable with mean zero 
and finite variance a 2 . Suppose that some of the additive components fj 
are zero. The problem addressed in this paper is to distinguish the nonzero 
components from the zero components and estimate the nonzero compo- 
nents. We allow the possibility that p is larger than the sample size n, which 
we represent by letting p increase as n increases. We propose a penalized 
method for variable selection in (1) and show that the proposed method can 
correctly select the nonzero components with high probability. 

There has been much work on penalized methods for variable selection 
and estimation with high-dimensional data. Methods that have been pro- 
posed include the bridge estimator [Frank and Friedman (1993), Huang, 
Horowitz and Ma (2008)]; least absolute shrinkage and selection opera- 
tor or Lasso [Tibshirani (1996)], the smoothly clipped absolute deviation 
(SCAD) penalty [Fan and Li (2001), Fan and Peng (2004)], and the min- 
imum concave penalty [Zhang (2010)]. Much progress has been made in 
understanding the statistical properties of these methods. In particular, 
many authors have studied the variable selection, estimation and predic- 
tion properties of the Lasso in high-dimensional settings. See, for exam- 
ple, Meinshausen and Btihlmann (2006), Zhao and Yu (2006), Zou (2006), 
Bunea, Tsybakov and Wegkamp (2007), Meinshausen and Yu (2009), Huang, 
Ma and Zhang (2008), van de Geer (2008) and Zhang and Huang (2008), 
among others. All these authors assume a linear or other parametric model. 
In many applications, however, there is little a priori justification for as- 
suming that the effects of covariates take a linear form or belong to any 
other known, finite-dimensional parametric family. For example, in studies 
of economic development, the effects of covariates on the growth of gross 
domestic product can be nonlinear. Similarly, there is evidence of nonlin- 
earity in the gene expression data used in the empirical example in Sec- 
tion 5. 

There is a large body of literature on estimation in nonparametric addi- 
tive models. For example, Stone (1985, 1986) showed that additive spline 
estimators achieve the same optimal rate of convergence for a general fixed 
p as for p= 1. Horowitz and Mammen (2004) and Horowitz, Klemela and 
Mammen (2006) showed that if p is fixed and mild regularity conditions 
hold, then oracle-efficient estimates of the /j's can be obtained by a two- 
step procedure. Here, oracle efficiency means that the estimator of each fj 
has the same asymptotic distribution that it would have if all the other 
fj's were known. However, these papers do not discuss variable selection in 
nonparametric additive models. 

Antoniadis and Fan (2001) proposed a group SCAD approach for regular- 
ization in wavelets approximation. Zhang et al. (2004) and Lin and Zhang 
(2006) have investigated the use of penalization methods in smoothing spline 
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ANOVA with a fixed number of covariates. Zhang et al. (2004) used a Lasso- 
type penalty but did not investigate model-selection consistency. Lin and 
Zhang (2006) proposed the component selection and smoothing operator 
(COSSO) method for model selection and estimation in multivariate non- 
parametric regression models. For fixed p, they showed that the COSSO 
estimator in the additive model converges at the rate n - d /( 2d + 1 ) t where d 
is the order of smoothness of the components. They also showed that, in 
the special case of a tensor product design, the COSSO correctly selects the 
nonzero additive components with high probability. Zhang and Lin (2006) 
considered the COSSO for nonparametric regression in exponential families. 

Meier, van de Geer and Biihlmann (2009) treat variable selection in a 
nonparametric additive model in which the numbers of zero and nonzero 
/j's may both be larger than n. They propose a penalized least-squares es- 
timator for variable selection and estimation. They give conditions under 
which, with probability approaching 1, their procedure selects a set of fj's 
containing all the additive components whose distance from zero in a certain 
metric exceeds a specified threshold. However, they do not establish model- 
selection consistency of their procedure. Even asymptotically, the selected 
set may be larger than the set of nonzero fj's. Moreover, they impose a 
compatibility condition that relates the levels and smoothness of the fj's. 
The compatibility condition does not have a straightforward, intuitive inter- 
pretation and, as they point out, cannot be checked empirically. Ravikumar 
et al. (2009) proposed a penalized approach for variable selection in non- 
parametric additive models. In their approach, the penalty is imposed on 
the £2 norm of the nonparametric components, as well as the mean value 
of the components to ensure identifiability. In their theoretical results, they 
require that the eigenvalues of a "design matrix" be bounded away from zero 
and infinity, where the "design matrix" is formed from the basis functions 
for the nonzero components. It is not clear whether this condition holds in 
general, especially when the number of nonzero components diverges with 
n. Another critical condition required in the results of Ravikumar et al. 
(2009) is similar to the irrepresentable condition of Zhao and Yu (2006). It 
is not clear for what type of basis functions this condition is satisfied. We 
do not require such a condition in our results on selection consistency of the 
adaptive group Lasso. 

Several other recent papers have also considered variable selection in non- 
parametric models. For example, Wang, Chen and Li (2007) and Wang and 
Xia (2008) considered the use of group Lasso and SCAD methods for model 
selection and estimation in varying coefficient models with a fixed number of 
coefficients and covariates. Bach (2007) applies what amounts to the group 
Lasso to a nonparametric additive model with a fixed number of covariates. 
He established model selection consistency under conditions that are consid- 
erably more complicated than the ones we require for a possibly diverging 
number of covariates. 
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In this paper, we propose to use the adaptive group Lasso for variable 
selection in (1) based on a spline approximation to the nonparametric com- 
ponents. With this approximation, each nonparametric component is rep- 
resented by a linear combination of spline basis functions. Consequently, 
the problem of component selection becomes that of selecting the groups 
of coefficients in the linear combinations. It is natural to apply the group 
Lasso method, since it is desirable to take into the grouping structure in the 
approximating model. To achieve model selection consistency, we apply the 
group Lasso iteratively as follows. First, we use the group Lasso to obtain 
an initial estimator and reduce the dimension of the problem. Then we use 
the adaptive group Lasso to select the final set of nonparametric compo- 
nents. The adaptive group Lasso is a simple generalization of the adaptive 
Lasso [Zou (2006)] to the method of the group Lasso [Yuan and Lin (2006)]. 
However, here we apply this approach to nonparametric additive modeling. 

We assume that the number of nonzero fj's is fixed. This enables us to 
achieve model selection consistency under simple assumptions that are easy 
to interpret. We do not have to impose compatibility or irrepresentable con- 
ditions, nor do we need to assume conditions on the eigenvalues of certain 
matrices formed from the spline basis functions. We show that the group 
Lasso selects a model whose number of components is bounded with prob- 
ability approaching one by a constant that is independent of the sample 
size. Then using the group Lasso result as the initial estimator, the adaptive 
group Lasso selects the correct model with probability approaching 1 and 
achieves the optimal rate of convergence for nonparametric estimation of an 
additive model. 

The remainder of the paper is organized as follows. Section 2 describes 
the group Lasso and the adaptive group Lasso for variable selection in non- 
parametric additive models. Section 3 presents the asymptotic properties of 
these methods in "large p, small n" settings. Section 4 presents the results of 
simulation studies to evaluate the finite-sample performance of these meth- 
ods. Section 5 provides an illustrative application, and Section 6 includes 
concluding remarks. Proofs of the results stated in Section 3 are given in 
the Appendix. 

2. Adaptive group Lasso in nonparametric additive models. We describe 
a two-step approach that uses the group Lasso for variable selection based 
on a spline representation of each component in additive models. In the first 
step, we use the standard group Lasso to achieve an initial reduction of the 
dimension in the model and obtain an initial estimator of the nonparametric 
components. In the second step, we use the adaptive group Lasso to achieve 
consistent selection. 

Suppose that each Xj takes values in [a, b] where a <b are finite num- 
bers. To ensure unique identification of the /j's, we assume that Fifj(Xj) = 
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0, 1 < j < p. Let a = £o < £i < ' ' ■ < £k < Ck+i = b be a partition of [a, b] into 
K subintervals I Kt = [&,&+i),* = 0, ...,K-1, and Irk = [€k,£k+i], where 
K = K n = n v with < v < 0.5 is a positive integer such that m&xi<k<K+i |£fc _ 
Cfe-il = ©(n^ - "). Let 5 n be the space of polynomial splines of degree / > 1 
consisting of functions s satisfying: (i) the restriction of s to Ixt is a poly- 
nomial of degree I for 1 < t < K; (ii) for I > 2 and < I' < I — 2, s is I' 
times continuously differentiable on [a,b]. This definition is phrased after 
Stone (1985), which is a descriptive version of Schumaker (1981), page 108, 
Definition 4.1. 

There exists a normalized B-spline basis {(f>k, 1 < < ^n} for «S n , where 
m n = K n + / [Schumaker (1981)]. Thus, for any f n j E S n , we can write 



m„ 



(3) L n ( / u,/3J = ^ 



(2) fnj (x) = 2^/3 jk4>k(x), l<j<P- 

k=l 

Under suitable smoothness assumptions, the f/s can be well approximated 
by functions in S n . Accordingly, the variable selection method described in 
this paper is based on the representation (2). 

Let 1 1 a 1 1 2 = (X^jli l a j| 2 ) 1//2 denote the £2 norm of any vector a G M m . 
Let (3 n j = (fat, Pjm n )' and (3 n = (@' nl , (3' np )'. Let w n = (w nl , w np )' 
be a given vector of weights, where < w n j < 00, 1 < j < p. Consider the 
penalized least squares criterion 

P m n "I 2 p 

J'=l fc=l J 3=1 

where A n is a penalty parameter. We study the estimators that minimize 
L n (n,P n ) subject to the constraints 

n m n 

(4) j2J2^kM^ij)=o, i<j<p. 

i=l k=l 

These centering constraints are sample analogs of the identifying restriction 
Efj(Xj) = 0, 1 < j < p. We can convert (3) and (4) to an unconstrained 
optimization problem by centering the response and the basis functions. Let 

1 n 

(5) 4>jk = <t>k(Xij), tpjk(x) = (f>k{x) ~ 4>jk- 

n i=l 

For simplicity and without causing confusion, we simply write ipk(x) = 
t/jjk(x). Define 

Zij = (^(Xij), . . . jYwpQj))'- 

So, Zij consists of values of the (centered) basis functions at the ith obser- 
vation of the jth covariate. Let Z,- = {Zu, . . . , Z n j)' be the n x m n "design" 



G 



J. HUANG, J. L. HOROWITZ AND F. WEI 



matrix corresponding to the j'th covariate. The total "design" matrix is 
Z = (Zi, . . . , Zp). Let Y = (Yi - Y, . . . , Y n - Y)'. With this notation, we can 
write 

p 

(6) L n (f3 n , A) = || Y - Z/3JI + XnJ2 W nj WPnj h- 

3=1 

Here, we have dropped \i in the argument of L n . With the centering, ju = Y. 
Then minimizing (3) subject to (4) is equivalent to minimizing (6) with 
respect to f3 n , but the centering constraints are not needed for (6). 

We now describe the two-step approach to component selection in the 
nonparametric additive model (1). 

Step 1. Compute the group Lasso estimator. Let 

p 

LnliPm = ||Y - Z/3J|| + X n \ ^ ||/3 n jl|2- 

3=1 

This objective function is the special case of (6) that is obtained by set- 
ting w n j = 1, 1 < j < p. The group Lasso estimator is (3 n = /3 n (A n i) = 
aigmw.p n L nl (P n ;\ni). 

Step 2. Use the group Lasso estimator f3 n to obtain the weights by setting 

[PnillJ 1 , if ||3 ni || 2 >0, 

The adaptive group Lasso objective function is 

p 

L n 2((3 n ; A n2 ) = ||Y - Z/3 n ||| + A n2 y^ w n j||/3 n? -||2- 

3=1 

Here, we define • oo = 0. Thus, the components not selected by the group 
Lasso jire not included in Step 2. The adaptive group Lasso estimator is 
P n = Pn(X n 2) = arg m i n /3 n L n 2 (/3 n ; A n2 ) . Finally, the adaptive group Lasso 
estimators of /U and fj are 

n m n 
V-n =Y = n _1 ^Vi, f nj (x) = ^2^jk1pk(x), 1 <i <p. 

i=l k=l 

3. Main results. This section presents our results on the asymptotic 
properties of the estimators defined in Steps 1 and 2 of Section 2. 

Let A; be a nonnegative integer, and let a £ (0, 1] be such that d = k + a > 
0.5. Let T be the class of functions / on [0, 1] whose fcth derivative 
exists and satisfies a Lipschitz condition of order a: 

\ f W {s) -fW { t)\<C\s-t\ a iors,tE[a,b]. 
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In (1), without loss of generality, suppose that the first q components 
are nonzero, that is, fj(x) ^ 0, 1 < j < q, but fj(x) = 0, q + 1 < j < p. Let 

A 1= {l,...,q} and A Q = {q + 1, . . . ,p}. Define ||/|| 2 = f(x) dx) 1 ' 2 for 
any function /, whenever the integral exists. 
We make the following assumptions. 

(Al) The number of nonzero components q is fixed and there is a constant 
Cf > such that mini<j< g ||/j||2 > cj. 

(A2) The random variables e±, . . . , e n are independent and identically dis- 
tributed with Eej = and Var(ej) = a 2 . Furthermore, their tail probabilities 
satisfy P > x) < K exp(— Cx 2 ), i = 1, . .. ,n, for all x > and for con- 
stants C and K. 

(A3) Efj(Xj) = and / 3 G J,j = l,..., q. 

(A4) The covariate vector X has a continuous density and there exist 
constants C\ and C2 such that the density function gj of Xj satisfies < 
Ci < 5'i( :c ) < C 2 < 00 on [a, b] for every 1 < j <p. 

We note that (Al), (A3) and (A4) are standard conditions for nonpara- 
metric additive models. They would be needed to estimate the nonzero addi- 
tive components at the optimal £2 rate of convergence on [a, b], even if q were 
fixed and known. Only (A2) strengthens the assumptions needed for non- 
parametric estimation of a nonparametric additive model. While condition 
(Al) is reasonable in most applications, it would be interesting to relax this 
condition and investigate the case when the number of nonzero components 
can also increase with the sample size. The only technical reason that we 
assume this condition is related to Lemma 3 given in the Appendix, which 
is concerned with the properties of the smallest and largest eigenvalues of 
the "design matrix" formed from the spline basis functions. If this lemma 
can be extended to the case of a divergent number of components, then (Al) 
can be relaxed. However, it is clear that there needs to be restriction on the 
number of nonzero components to ensure model identification. 

3.1. Estimation consistency of the group Lasso. In this section, we con- 
sider the selection^ and estimation properties of the group Lasso estimator. 
Define A\ = {j : \\(3 n j || 2 7^ 0, 1 < j < p}. Let |^4| denote the cardinality of any 
set AC {l,...,p}. 

Theorem 1. Suppose that (Al) to (A4) hold and A„i > C^Jn log(p?n n ) 
for a sufficiently large constant C . 

(i) With probability converging to 1, \A\ \ < Mi|^4i| = M\q for a finite 
constant M\ > 1. 

(ii) If log {pm n )/n — > and (X^m^/n 2 — > as n— >-oo, then all the 
nonzero (3 n j, 1 < j < q, are selected with probability converging to one. 
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P 




) 



(iii) £||/3 ni -/3, 



Part (i) of Theorem 1 says that, with probability approaching 1, the group 
Lasso selects a model whose dimension is a constant multiple of the number 
of nonzero additive components fj, regardless of the number of additive 
components that are zero. Part (ii) implies that every nonzero coefficient 
will be selected with high probability. Part (iii) shows that the difference 
between the coefficients in the spline representation of the nonparametric 
functions in (1) and their estimators converges to zero in probability. The 
rate of convergence is determined by four terms: the stochastic error in 
estimating the nonparametric components (the first term) and the intercept 
fj, (the second term), the spline approximation error (the third term) and 
the bias due to penalization (the fourth term). 

Let f n j{x) = Y^=i Pjk' t l } ( x \ 1 < J < ?>• The following theorem is a conse- 
quence of Theorem 1. 

Theorem 2. Suppose that (Al) to (A4) hold and that A„i > 
C\Jn log(pm n ) for a sufficiently large constant C. Then: 

(i) Let Af = {j : \\f n j lb > 0, 1 < j < p}. There is a constant M\ > 1 such 
that, with probability converging to 1, \Af \ < M\q. 

(ii) // (m n log (pm n ))/n — > and (X^m^/n 2 — > as n -4 oo, then all 
the nonzero additive components fj,l<j< q, are selected with probability 
converging to one. 



where A2 = A± U A± . 

Thus, under the conditions of Theorem 2, the group Lasso selects all 
the nonzero additive components with high probability. Part (iii) of the 
theorem gives the rate of convergence of the group Lasso estimator of the 
nonparametric components. 

For any two sequences {a n , b n , n = 1, 2, . . .}, we write a n x b n if there are 
constants < c\ < c<i < 00 such that c\ < a n /b n < c<i for all n sufficiently 
large. 

We now state a useful corollary of Theorem 2. 



(iii) 
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Corollary \. Suppose that (Al) to (A4) hold. If X n \ x \Jn log(pm n ) 
and m n ^< n 1 ^ 2d+1 \ then: 

(i) If n~ 2d ^ 2d+1 ^ log(p) — > as n — > oo, i/ien u>ii/i probability converging 
to one, all the nonzero components fj,l<j< q, are selected and the number 
of selected components is no more than M±q. 

(ii) ll/ni - fM = O p (n~ 2d ^ d+ V l g(pm n )), j € 2 2 . 

For the A„i and m n given in Corollary 1, the number of zero components 
can be as large as exp(o(n 2d ^ 2d+1 ^)). For example, if each fj has continuous 
second derivative (d = 2), then it is exp(o(n 4 / 5 )), which can be much larger 
than n. 



3.2. Selection consistency of the adaptive group Lasso. We now consider 
the properties of the adaptive group Lasso. We first state a general result 
concerning the selection consistency of the adaptive group Lasso, assuming 
an initial consistent estimator is available. We then apply to the case when 
the group Lasso is used as the initial estimator. We make the following 
assumptions. 

(Bl) The initial estimators (3 n j are r n -consistent at zero: 



r n max -|| 2 = Op(l), r n -> oo, 
and there exists a constant q, > such that 



P min||/3 || 2 > c b b nl -> 1, 



»j 



i 2- 



where b n i = mm jeAl ||/3 

(B2) Let q be the number of nonzero components and s n =p — q be the 
number of zero components. Suppose that: 

\ 1/4 

/ \ m n A n2 m„ 

(a —U2 + = °( 1 5 

/,n n 1 / 2 log 1/2 (s n m n ) n , . 

We state condition (Bl) for a general initial estimator, to highlight the 
point that the availability of an r n -consistent estimator at zero is crucial 
for the adaptive group Lasso to be selection consistent. In other words, any 
initial estimator satisfying (Bl) will ensure that the adaptive group Lasso 
(based on this initial estimator) is selection consistent, provided that certain 
regularity conditions are satisfied. We note that it follows immediately from 
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Theorem 1 that the group Lasso estimator satisfies (Bl). We will come back 
to this point below. 

For 3„ = (3nl> • • • >3np)' and Pn = (Pnli ■ ■ ■ , Pnp)' \ we sa y 3n =0 P n if 

s g n o(ll3„jll) = sgn (||/3 ni ||),l <j<p, where sgn (|x|) = 1 if \x\ > and =0 
if \x\ =0. 

Theorem 3. Suppose that conditions (Bl), (B2) and (Al)-(A4) hold. 
Then: 

(i) P(3„=o/3J^l. 

(h) i\\K-^\\i=o P (^y 0p (^j 



+ ol ^-) + o( i <- X2 



n^n2 



This theorem is concerned with the selection and estimation properties of 
the adaptive group Lasso in terms of (3 n . The following theorem states the 
results in terms of the estimators of the nonparametric components. 

Theorem 4. Suppose that conditions (Bl), (B2) and (Al)-(A4) hold. 
Then: 

(i) P(||/nj||2>0,j eAi und\\f nj \\2 = 0,j€Ao)->l. 



1 \ . „ / 4m n A2 2 



Part (i) of this theorem states that the adaptive group Lasso can consis- 
tently distinguish nonzero components from zero components. Part (ii) gives 
an upper bound on the rate of convergence of the estimator. 

We now apply the above results to our proposed procedure described in 
Section 2, in which we first obtain the the group Lasso estimator and then 
use it as the initial estimator in the adaptive group Lasso. 

By Theorem 1, if A n i X yfn \og(pm n ) and m n x n 1 /( 2<i + 1 ) for d > 1, then 
the group Lasso estimator satisfies (Bl) with r n x n d '( 2d+l > / yJ\og{jpm n ). In 
this case, (B2) simplifies to 

A n2 n l/(4d + 2) log l/2 (;pmn) 
(7) n (8 rf+ 3)/(8 rf+ 4) = °( 1 ) and ^ = ° (1) - 

We summarize the above discussion in the following corollary. 
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Corollary 2. Let the group Lasso estimator (3 n = f3 n (X n i) with X n i x 
\J n\og(pm n ) and m n x n l ^ 2d+l ' > be the initial estimator in the adaptive 
group Lasso. Suppose that the conditions of Theorem 1 hold. If X n 2 < 0(n 1 / 2 ) 
and satisfies (7), then the adaptive group Lasso consistently selects the nonzero 
components in (1), that is, part (i) of Theorem 4 holds. In addition, 

j:\\fn 3 -f 3 \\l = O p {n-^^). 



This corollary follows directly from Theorems 1 and 4. The largest A n ,2 
allowed is A„2 = 0{n 1 / 2 ). With this A n 2> the first equation in (6) is satisfied. 
Substitute it into the second equation in (6), we obtain p = exp(o(re 2rf ^ 2rf+1 ^)), 
which is the largest p permitted and can be larger than n. Thus, under the 
conditions of this corollary, our proposed adaptive group Lasso estimator 
using the group Lasso as the initial estimator is selection consistent and 
achieves optimal rate of convergence even when p is larger than n. Follow- 
ing model selection, oracle-efficient, asymptotically normal estimators of the 
nonzero components can be obtained by using existing methods. 

4. Simulation studies. We use simulation to evaluate the performance of 
the adaptive group Lasso with regard to variable selection. The generating 
model is 

p 

(8) yi = f{xi) +£i = ^2fj(xij) +s h i = l,...,n. 

i=i 

Since p can be larger than n, we consider two ways to select the penalty 
parameter, the BIC [Schwarz (1978)] and the EBIC [Chen and Chen (2008, 
2009)]. The BIC is defined as 

BIC(X) = log(RSS A ) + df x ■ 

n 

Here, RSSa is the residual sum of squares for a given A, and the degrees of 
freedom df\ = q\m n , where q\ is the number of nonzero estimated compo- 
nents for the given A. The EBIC is defined as 

EBIC(X) = log(RSS A ) + df x ■ l ^ + v • dfx ■ 

n n 

where < v < 1 is a constant. We use v = 0.5. 

We have also considered two other possible ways of defining df: (a) using 
the trace of a linear smoother based on a quadratic approximation; (b) using 
the number of estimated nonzero components. We have decided to use the 
definition given above based on the results from our simulations. We note 
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that the df for the group Lasso of Yuan and Lin (2006) requires an initial 
(least squares) estimator, which is not available when p> n. Thus, their df 
is not applicable to our problem. 

In our simulation example, we compare the adaptive group Lasso with 
the group Lasso and ordinary Lasso. Here, the ordinary Lasso estimator is 
defined as the value that minimizes 

||Y-Z/3j| + A n ^^|/3 ifc |. 

3=1 k=l 

This simple application of the Lasso does not take into account the grouping 
structure in the spline expansions of the components. The group Lasso and 
the adaptive group Lasso estimates are computed using the algorithm pro- 
posed by Yuan and Lin (2006). The ordinary Lasso estimates are computed 
using the Lars algorithms [Efron et al. (2004)]. The group Lasso is used as 
the initial estimate for the adaptive group Lasso. 

We also compare the results from the nonparametric additive modeling 
with those from the standard linear regression model with Lasso. We note 
that this is not a fair comparison because the generating model is highly 
nonlinear. Our purpose is to illustrate that it is necessary to use nonpara- 
metric models when the underlying model deviates substantially from linear 
models in the context of variable selection with high-dimensional data and 
that model misspecification can lead to bad selection results. 

Example 1. We generate data from the model 

v 

Vi = f(xi) + Si = ^2 fj( x ij) i = 1, . . . ,n, 

j'=i 

where /i(t) = 5i,/ 2 (i) = 3(2t - l) 2 ,/ 3 (t) = 4sin(2vrt)/(2 - sin(27rf)), / 4 (t) = 
6(0.1 sin(2vri) + 0.2cos(2vrt) + 0.3sin(27ri) 2 + 0.4cos(2vrt) 3 + 0.5sin(2vrt) 3 ), 
and f$(t) = ■ ■ ■ = f p (t) = 0. Thus, the number of nonzero functions is q = 4. 
This generating model is the same as Example 1 of Lin and Zhang (2006). 
However, here we use this model in high-dimensional settings. We consider 
the cases where p = 1000 and three different sample sizes: n = 50, 100 and 
200. We use the cubic B-spline with six evenly distributed knots for all the 
functions fk- The number of replications in all the simulations is 400. 

The covariates are simulated as follows. First, we generate wn, . . . , u>i p , Ui, 
u[,Vi independently from N(0, 1) truncated to the interval [0, 1], i = 1, . . . , n. 
Then we set x^ = (wo- + iuj)/(l + t) for k = 1, ... ,4 and = (wo- + 
tvi)/(l + t) for k = 5, . . . ,p, where the parameter t controls the amount of 
correlation among predictors. We have Corr(xjfc, Xy) = i 2 /(l + i 2 ), 1 < j < 4, 
1 < k < 4, and Corr(x,fc, Xij) = t 2 /(l + t 2 ), 4<j<p, 4<k<p, but the co- 
variates of the nonzero components and zero components are independent. 
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We consider t = 0, 1 in our simulation. The signal to noise ratio is defined 
to be sd(f)/sd(e). The error term is chosen to be ej ~ -/V(0, 1.27 2 ) to give a 
signal-to-noise ratio (SNR) 3.11 : 1. This value is the same as the estimated 
SNR in the real data example below, which is the square root of the ratio 
of the sum of estimated components squared divided by the sum of residual 
squared. 

The results of 400 Monte Carlo replications are summarized in Table 
1. The columns are the mean number of variables selected (NV), model 
error (ER), the percentage of replications in which all the correct additive 
components are included in the selected model (IN), and the percentage of 
replications in which precisely the correct components are selected (CS). 
The corresponding standard errors are in parentheses. The model error is 
computed as the average of n~ l Y2i=iif( x i) ~ f( x i)] 2 over the 400 Monte 
Carlo replications, where / is the true conditional mean function. 

Table 1 shows that the adaptive group Lasso selects all the nonzero com- 
ponents (IN) and selects exactly the correct model (CS) more frequently 
than the other methods do. For example, with the BIC and n = 200, the 
percentage of correct selections (CS) by the adaptive group Lasso ranges 
from 65.25% to 81%, which is much higher than the ranges 30-57.75% for 
the group Lasso and 12-15.75% for the ordinary Lasso. The adaptive group 
Lasso and group Lasso perform better than the ordinary Lasso in all of the 
experiments, which illustrates the importance of taking account of the group 
structure of the coefficients of the spline expansion. Correlation among co- 
variates increases the difficulty of component selection, so it is not surprising 
that all methods perform better with independent covariates than with cor- 
related ones. The percentage of correct selections increases as the sample 
size increases. The linear model with Lasso never selects the correct model. 
This illustrates the poor results that can be produced by a linear model 
when the true conditional mean function is nonlinear. 

Table 1 also shows that the model error (ME) of the group Lasso is only 
slightly larger than that of the adaptive group Lasso. The models selected 
by the group Lasso nest and, therefore, have more estimated coefficients 
than the models selected by the adaptive group Lasso. Therefore, the group 
Lasso estimators of the conditional mean function have a larger variance and 
larger ME. The differences between the MEs of the two methods are small, 
however, because as can be seen from the NV column, the models selected 
by the group Lasso in our experiments have only slightly more estimated 
coefficients than the models selected by the adaptive group Lasso. 

Example 2. We now compare the adaptive group Lasso with the COSSO 
[Lin and Zhang (2006)]. This comparison is suggested to us by the Associate 
Editor. Because the COSSO algorithm only works for the case when p is 
smaller than n, we use the same set-up as in Example 1 of Lin and Zhang 



Table 1 

Example 1. Simulation results for the adaptive group Lasso, group Lasso, ordinary Lasso, and linear model with Lasso, n = 50, 100 or 
200, p= 1000. NV, average number of the variables being selected; ME, model error; IN, percentage of occasions on which the correct 
components are included in the selected model; CS, percentage of occasions on which correct components are selected, averaged over 400 
replications. Enclosed in parentheses are the corresponding standard errors. Top panel, independent predictors; bottom panel, correlated 

predictors 
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(2006). In this example, the generating model is as in (8) with 4 nonzero 
components. Let Xj = (Wj + tU)/(l + t), j = 1, . . . ,p, where W%, . . . , W p 
and U are i.i.d. from ^(0,1), truncated to the interval [0,1]. Therefore, 
covv(Xj, Xk) = t 2 /(l + t 2 ) for j ^ k. The random error term e ~ A^O, 1.32 2 ). 
The SNR is 3:1. We consider three different sample sizes n = 50, 100 or 200 
and three different number of predictors p = 10, 20 or 50. The COSSO esti- 
mator is computed using the Matlab software which is publicly available at 
http : //www4 . stat . ncsu . edu/~hzhang/cosso . html. 

The COSSO procedure uses either generalized cross-validation or 5-fold 
cross-validation. Based the simulation results of Lin and Zhang (2006) and 
our own simulations, the COSSO with 5-fold cross-validation has better 
selection performance. Thus, we compare the adaptive group Lasso with 
BIC or EBIC with the COSSO with 5-fold cross-validation. The results are 
given in Table 2. For independent predictors, when n = 200 and p = 10, 20 
or 50, the adaptive group Lasso and COSSO have similar performance in 
terms of selection accuracy and model error. However, for smaller n and 
larger p, the adaptive group Lasso does significantly better. For example, 
for n = 100 and p = 50, the percentage of correct selection for the adaptive 
group Lasso is 81-83%, but it is only 11% for the COSSO. The model error of 
the adaptive group Lasso is similar to or smaller than that of the COSSO. In 
several experiments, the model error of the COSSO is 2 to more than 7 times 
larger than that of the adaptive group Lasso. It is interesting to note that 
when n = 50 and p = 20 or 50, the adaptive group Lasso still does a descent 
job in selecting the correct model, but the COSSO does poorly in these two 
cases. In particular, for n = 50 and p = 50, the COSSO did not select the 
exact correct model in all the simulation runs. For dependent predictors, 
the comparison is even mode favorable to the adaptive group Lasso, which 
performs significantly better than COSSO in terms of both model error and 
selection accuracy in all the cases. 

5. Data example. We use the data set reported in Scheetz et al. (2006) 
to illustrate the application of the proposed method in high-dimensional set- 
tings. For this data set, 120 twelve- week old male rats were selected for tissue 
harvesting from the eyes and for microarray analysis. The microarrays used 
to analyze the RNA from the eyes of these animals contain over 31,042 dif- 
ferent probe sets (Affymetric GeneChip Rat Genome 230 2.0 Array). The in- 
tensity values were normalized using the robust multi-chip averaging method 
[Irizzary et al. (2003)] method to obtain summary expression values for each 
probe set. Gene expression levels were analyzed on a logarithmic scale. 

We are interested in finding the genes that are related to the gene TRIM32. 
This gene was recently found to cause Bardet-Biedl syndrome [Chiang et al. 
(2006)], which is a genetically heterogeneous disease of multiple organ sys- 
tems including the retina. Although over 30,000 probe sets are represented 



Table 2 

Example 2. Simulation results comparing the adaptive group Lasso and COSSO. n = 50, 100 or 200, p= 10,20 or 50. NV, average 
number of the variables being selected; ME, model error; IN, percentage of occasions on which all the correct components are included in 
the selected model; CS, percentage of occasions on which correct components are selected, averaged over 400 replications. Enclosed in 

parentheses are the corresponding standard errors 
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3. 


70 


1 


.46 


75 


.00 


66 


,00 


3. 


71 


1 


.74 


72 


.00 


64 


,00 


3 


,20 


2 


,98 


65 


,00 


60 


.00 




(0. 


72) 


(0 


.78) 


(0 


.41) 


(0 


,46) 


(0. 


68) 


(1 


,06) 


(0 


.42) 


(0 


,48) 


(1 


,42) 


(1 


,96) 


(0 


,46) 


(0 


.50) 


COSSO(5CV) 


3. 


98 


1 


.42 


41 


.00 


26 


.00 


4. 


14 


1 


.76 


30 


.00 


6 


.00 


4 


24 


6 


.88 


8 


.00 





.00 




(0. 


64) 


(0 


,74) 


(0 


.49) 


(0 


.42) 


(2. 


27) 


(1 


.11) 


(0 


.46) 


(0 


.24) 


(2 


,96) 


(2 


.91) 


(0 


,27) 


(0 


.00) 


AGLasso(BIC) 


3. 


30 


2 


.26 


70 


.00 


62 


.00 


3. 


06 


3 


.02 


65 


.00 


60 


.00 


2 


,87 


4 


.01 


52 


,00 


42 


.00 




(1. 


16) 


(1 


.09) 


(0 


.46) 


(0 


.49) 


(1. 


52) 


(2 


.14) 


(0 


.46) 


(0 


.50) 


(1 


,56) 


(3 


.69) 


(0 


,44) 


(0 


.52) 


AGLasso(EBIC) 


3. 


32 


2 


.20 


70 


.00 


64 


.00 


3. 


10 


3 


.01 


68 


.00 


62 


.00 


2 


,90 


3 


.88 


50 


,00 


42 


.00 




(1. 


14) 


(1 


.06) 


(0 


.46) 


(0 


.48) 


(1. 


51) 


(2 


.12) 


(0 


.45) 


(0 


.49) 


(1 


,54) 


(3 


.62) 


(0 


,42) 


(0 


.52) 


COSSO(5CV) 


4. 


14 


3 


.77 


25 


.00 


6 


.00 


4. 


20 


6 


.98 


5 


.00 





.00 


4 


,90 


9 


.93 


1 


,00 





.00 




(2. 


25) 


(2 


.02) 


(0 


.44) 


(0 


.24) 


(2. 


88) 


(2 


.82) 


(0 


.22) 


(0 


,00) 


(3 


,30) 


(4 


,08) 


(0 


,10) 


(0 


.00) 
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on the Rat Genome 230 2.0 Array, many of them are not expressed in the 
eye tissue and initial screening using correlation shows that most probe sets 
have very low correlation with TRIM32. In addition, we are expecting only 
a small number of genes to be related to TRIM32. Therefore, we use 500 
probe sets that are expressed in the eye and have highest marginal corre- 
lation in the analysis. Thus, the sample size is n = 120 (i.e., there are 120 
arrays from 120 rats) and p = 500. It is expected that only a few genes are 
related to TRIM32. Therefore, this is a sparse, high-dimensional regression 
problem. 

We use the nonparametric additive model to model the relation between 
the expression of TRIM32 and those of the 500 genes. We estimate model 
(1) using the ordinary Lasso, group Lasso, and adaptive group Lasso for the 
nonparametric additive model. To compare the results of the nonparametric 
additive model with that of the linear regression model, we also analyzed 
the data using the linear regression model with Lasso. We scale the covari- 
ates so that their values are between and 1 and use cubic splines with six 
evenly distributed knots to estimate the additive components. The penalty 
parameters in all the methods are chosen using the BIC or EBIC as in the 
simulation study. Table 3 lists the probes selected by the group Lasso and 
the adaptive group Lasso, indicated by the check signs. Table 4 shows the 
number of variables, the residual sums of squares obtained with each estima- 
tion method. For the ordinary Lasso with the spline expansion, a variable is 
considered to be selected if any of the estimated coefficients of the spline ap- 
proximation to its additive component are nonzero. Depending on whether 
BIC or EBIC is used, the group Lasso selects 16-17 variables, the adap- 
tive group Lasso selects 15 variables and the ordinary Lasso with the spline 
expansion selects 94-97 variables, the linear model selects 8-14 variables. 
Table 4 shows that the adaptive group Lasso does better than the other 
methods in terms of residual sum of squares (RSS). We have also examined 
the plots (not shown) of the estimated additive components obtained with 
the group Lasso and the adaptive group Lasso, respectively. Most are highly 
nonlinear, confirming the need for taking into account nonlinearity. 

In order to evaluate the performance of the methods, we use cross-validation 
and compare the prediction mean square errors (PEs). We randomly parti- 
tion the data into 6 subsets, each set consisting of 20 observations. We then 
fit the model with 5 subsets as training set and calculate the PE for the 
remaining set which we consider as test set. We repeat this process 6 times, 
considering one of the 6 subsets as test set every time. We compute the 
average of the numbers of probes selected and the prediction errors of these 
6 calculations. Then we replicate this process 400 times (this is suggested to 
us by the Associate Editor). Table 5 gives the average values over 400 repli- 
cations. The adaptive group Lasso has smaller average prediction error than 
the group Lasso, the ordinary Lasso and the linear regression with Lasso. 
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Table 3 

Probe sets selected by the group Lasso and the adaptive group Lasso in the data example 
using BIC or EBIC for penalty parameter selection. GL, group Lasso; AGL, adaptive 
group Lasso; Linear, linear model with Lasso 



Probes 


GL(BIC) 


AGL(BIC) 


Linear(BIC) 


GL(EBIC) 


AGL(EBIC) 


Linear(EBIC) 


1389584_at 


V 


V 


V 








1383o73_at 


1 

V 


V 


/ 

V 


/ 

V 


V 


/ 

V 


1379971_at 


V 


V 


V 




V 


V 


1374106_at 


V 




V 


V 




V 


1393817_at 


V 




V 


V 






1373776_at 


V 


V 


V 




V 




1377187_at 


si 


V 




V 


V 




1393955_at 




si 


V 


V 


si 




1393684_at 








V 


V 




1381515_at 


V 


V 




V 


V 




1382835_at 


V 


V 




V 


V 




1385944_at 


V 


V 


V 


V 


V 




1382263_at 


si 


V 


V 


V 


V 


V 


1380033_at 


V 






V 


V 




1398594_at 












V 


1376744_at 


V 


si 




V 






1382633_at 


V 


V 




V 


V 




1383110_at 






V 






V 


1386683_at 






V 









The ordinary Lasso selects far more probe sets than the other approaches, 
but this does not lead to better prediction performance. Therefore, in this 
example, the adaptive group Lasso provides the investigator a more targeted 
list of probe sets, which can serve as a starting point for further study. 

It is of interest to compare the selection results from the adaptive group 
Lasso and the linear regression model with Lasso. The adaptive group Lasso 
and the linear model with Lasso select different sets of genes. When the 



Table 4 

Analysis results for the data example. No. of probes, the number of probe sets selected; 
RSS, the residual sum of squares of the fitted model 





BIC 




EBIC 




No. of probe sets 


RSS 


No. of probe sets 


RSS 


Adaptive group Lasso 


15 


1.52e-03 


15 


1.52e-03 


Group Lasso 


17 


3.24e-03 


16 


3.40e-03 


Ordinary Lasso 


97 


2.96e-07 


94 


8.10e-08 


Linear regression with Lasso 


14 


2.62e-03 


8 


3.75e-03 
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Table 5 

Comparison of adaptive group Lasso, group Lasso, ordinary Lasso, and linear regression 
model with Lasso for the data example. ANP, the average number of probe sets selected 
averaged across 400 replications; PE, the average of prediction mean square errors for 

the test set 





Adaptive 












Linear 




group Lasso 


Group Lasso 


Ordinary Lasso 


model with Lasso 




ANP PE 


ANP 


PE 


ANP 


PE 


ANP 


PE 


BIC 


15.75 1.86e-02 


16.45 


2.89e-02 


78.48 


1.40e-02 


9.25 


2.26e-02 




(0.85) (0.47e-02) 


(0.88) 


(0.49e-02) 


(3.62) 


(0.90e-02) 


(0.88) 


(1.41e-2) 


EBIC 


15.55 1.78e-02 


16.75 


1.99e-02 


80.00 


1.23e-02 


9.15 


2.03e-02 




(0.82) (0.42e-02) 


(0.84) 


(0.47e-02) 


(3.50) 


(0.89e-02) 


(0.86) 


(1.39e-02) 



penalty parameter is chosen with the BIC, the adaptive group Lasso selects 
5 genes that are not selected by the linear model with Lasso. In addition, the 
linear model with Lasso selects 5 genes that are not selected by the adaptive 
group Lasso. When the penalty parameter is selected with the EBIC, the 
adaptive group Lasso selects 10 genes that are not selected by the linear 
model with Lasso. The estimated effects of many of the genes are nonlinear, 
and the Monte Carlo results of Section 4 show that the performance of the 
linear model with Lasso can be very poor in the presence of nonlinearity. 
Therefore, we interpret the differences between the gene selections of the 
adaptive group Lasso and the linear model with Lasso as evidence that the 
selections produced by the linear model are misleading. 

6. Concluding remarks. In this paper, we propose to use the adap- 
tive group Lasso for variable selection in nonparametric additive models in 
sparse, high-dimensional settings. A key requirement for the adaptive group 
Lasso to be selection consistent is that the initial estimator is estimation 
consistent and selects all the important components with high probability. 
In low-dimensional settings, finding an initial consistent estimator is rela- 
tively easy and can be achieved by many well-established approaches such as 
the additive spline estimators. However, in high-dimensional settings, find- 
ing an initial consistent estimator is difficult. Under the conditions stated 
in Theorem 1, the group Lasso is shown to be consistent and selects all 
the important components. Thus the group Lasso can be used as the ini- 
tial estimator in the adaptive Lasso to achieve selection consistency. Follow- 
ing model selection, oracle-efficient, asymptotically normal estimators of the 
nonzero components can be obtained by using existing methods. Our sim- 
ulation results indicate that our procedure works well for variable selection 
in the models considered. Therefore, the adaptive group Lasso is a useful 
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approach for variable selection and estimation in sparse, high-dimensional 
nonparametric additive models. 

Our theoretical results are concerned with a fixed sequence of penalty pa- 
rameters, which are not applicable to the case where the penalty parameters 
are selected based on data driven procedures such as the BIC. This is an im- 
portant and challenging problem that deserves further investigation, but is 
beyond the scope of this paper. We have only considered linear nonparamet- 
ric additive models. The adaptive group Lasso can be applied to generalized 
nonparametric additive models, such as the generalized logistic nonparamet- 
ric additive model and other nonparametric models with high-dimensional 
data. However, more work is needed to understand the properties of this 
approach in those more complicated models. 

APPENDIX: PROOFS 
We first prove the following lemmas. Denote the centered versions of S n 

by 

{ ra n > 
fnj : fnj (*) = b ^ ^ ' " • • ' hmn ) e K™' 1 J , 1 < 3 < P, 

where tpk's are the centered spline bases defined in (5). 

Lemma 1. Suppose that f G J- and ~Ef(Xj) = 0. Then under (A3) and 
(A4), there exists an f n G<Sjjj satisfying 

||/n- fh = O p (m- d + my 2 n" 1 /2 ) . 

In particular, if we choose m n = 0(n 1 ^ 2d+1 ^), then 

\\f n -fh = O p (m- d ) = O p (n- d ^ d+ V). 

Proof. By (A4), for / G J, there is an /* G S n such that ||/ - /*|| 2 = 
0(m~ d ). Let /„ = /*- n^ZtifniXij)- Then f n G 5^ and \f n - f\ < 
|/* — /| + |P n /*|, where P n is the empirical measure ofi.i.d. random variables 
X\ j , . . . , X n j . Consider 

Pnf: = (Pn-P)f:+P(fn-f)- 

Here, we use the linear functional notation, for example, Pf = f fdP, where 
P is the probability measure of X\j. For any e > 0, the bracketing num- 
ber N [ . ] (e,S^,L 2 (P)) of satisfies log N^e , S° n] , L 2 (P)) < Cl m n log(l/e) 
for some constant c\ > [Shen and Wong (1994), page 597]. Thus, by 
the maximal inequality; see, for example, van der Vaart (1998, page 288), 

(P n -P)f* = O p (n- l / 2 rn 1 n /2 ) .By (A4), \P(f* - f)\ < C 2 \\f* - f\\ 2 = 0(m" rf ) 
for some constant C 2 > 0. The lemma follows from the triangle inequality. 
□ 
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Lemma 2. Suppose that conditions (A2) and (A4) hold. Let 

n 

T j k = n~ 1/2 m 1 n /2 '^2ip k (Xi j )ei, l<j<p,l<k< m n , 

i=l 

and T n = maxi<j< Pi i<K m „ \Tjk\- Then 

E(T n ) < C 1 n~ 1 / 2 m 1 r / 2 v / log(pm n )(^2C 2 mn 1 nlog(pm n ) 

+ 41og(2pm n ) + C 2 nm~ 1 ) 1 ^ 2 , 

where C\ andC 2 are two positive constants. In particular, when m n \og{jpm n ) / 
n — > 0, 



E(r„) = 0(l)Vlog(pm n ). 

Proof. Let s^- fc = ^?=i ^jfc(^ij)- Conditional on Xjj's, Tj^'s are sub- 
Gaussian. Let s^ = maxi<j< Pj i<fc< mn s 2 n - k . By (A2) and the maximal in- 
equality for sub-Gaussian random variables [van der Vaart and Wellner 
(1996), Lemmas 2.2.1 and 2.2.2], 

E( max \TjkWiXij, 1 < i < n, 1 < j <p}) < Cm" 1 ' 2 ™]! 2 s n ^\og{pm n ). 

\l<j<p,l<k<m n / 

Therefore, 

(9) E( max \T jk \) < Cm' 1 ' 2 !^ 2 ^\og{pm n )E{s n ), 

where C\ > is a constant. By (A4) and the properties of B-splines, 

(10) \MXi j )\<\M^j)\ + \^ j k\<2 and E(^ fc (A%-)) 2 < C 2 m~ l 

for a constant C 2 > 0, for every 1 < j < p and 1 < k < m n . By (10), 

n 

(11) ]TE[^(A^) - EipKXij)] 2 < 4C 2 nm" 1 
i=i 

and 



(12) Y,^ X ^^ C ^ nm n l - 

j = l 

By Lemma A.l of van de Geer (2008), (10) and (11) imply 

n 



E max 



i=i 



< J 2C 2 m n 1 n\og{pm n ) +A\og(2pm n ). 
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Therefore, by (12) and the triangle inequality, 

Es^ < J 2C 2 mn 1 nlog(pm n ) + 41og(2pm n ) + C^nro" 1 . 
Now since Es n < (Es^) 1//2 , we have 

(13) Es n < (^2C 2 mn 1 nlog(pm n ) + 4log(2pm n ) + C 2 nm~ 1 ) 1/2 . 
The lemma follows from (9) and (13). □ 

Denote 

(3 A = ((3' j ,jeA)' and Z A = (Z J ,jGA). 

Here, f3 A is an |j4|m n x 1 vector and Za is an n x |A|m n matrix. Let Ca = 
Z^Z^/n. When A = {1, . . . ,p}, we simply write C = Z'Z/n. Let p m i n (C^) 
and Pmax(C J 4) be the minimum and maximum eigenvalues of Ca, respec- 
tively. 

Lemma 3. Let m n = 0(n 7 ) where < 7 < 0.5. Suppose that \A\ is bounded 
by a fixed constant independent of n and p. Let h = h n ^. to" 1 . Then under 
(A3) and (A4), with probability converging to one, 

C\h n < /) min (CA) < Pmax(CA) < C 2 h n , 

where c\ and c 2 are two positive constants. 

Proof. Without loss of generality, suppose A = {1, . . . , k}. Then Za = 
(Zi, ...,Z q ). Let b = (bi, . . . , b' q )', where bj € R mn . By Lemma 3 of Stone 
(1985), 

HZibi + • • • + Z q b q \\ 2 > c 3 (||Zibi|| 2 + • • • + ||Zgbg|| 2 ) 
for a certain constant C3 > 0. By the triangle inequality, 

||Zibi + • ■ • + Z 9 bJ 2 < ||Zibi|| 2 + • • • + ||ZgbJ 2 . 
Since Z^b = Zibi + • • • + Z q h q , the above two inequalities imply that 
cadlZibiHa + • • • + ||Z,b,|| 2 ) < ||Z A b|| 2 < ||Zibi|| 2 + ■ ■ ■ + ||Z f/ bJ 2 . 
Therefore, 

C l(||Z 1 b 1 ||| + --- + ||Z g b g |||) 

(14) 

< ||Z A b||i < 2(||Zibi||l + • • • + HZ^bglH). 
Let Cj = n^Z'jZj. By Lemma 6.2 of Zhou, Shen and Wolf (1998), 

(15) C 4 /l < /)min(Cj) < /3 m ax(Cj) < C 5 h, j G A. 
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Since C A = n~ 1 Z' A Z A , it follows from (14) that 

cKbidbi + • • • + b' q C q b q ) < b'C A h < 2(biCib! + • • • + h' q C q h q ). 
Therefore, by (15), 

bjCibi b' q C q b q bjCibi ||b! Ill b;C g b g ||b g ||| 

iibin + '" + iibin iibiin iibin + '" + imm iibin 

l|bi|| 2 lib || 2 

_ Pmin 

II "lb ll^lb 



Similarly, 



Thus, we have 



> C4/1. 



bjCibi , b' q C q b q 

Mu 112 H 1" TTTTil - C 5 /l - 

l|b|b l|b|b 



9 , b'C 4 b 

C3C4/1 < — < 2c 5 /t. 



The lemma follows. □ 



Proof of Theorem 1. The proof of parts (i) and (ii) essentially follows 
the proof of Theorem 2.1 of Wei and Huang (2008). The only change that 
must be made here is that we need to consider the approximation error of 
the regression functions by splines. Specifically, let £ n = e n + S n , where S n = 
(5 nl ,...,5 nn )' with 6 ni = Y$=i(foj( x ij) ~ fnj(Xij)). Since ||/ j - /„j|| 2 = 
0{m- d ) = 0(n- rf /( M+1 )) for m n = n 1 /( M + 1 ) ) we have 



\\8nh < Cxy nqm n = C\qn 
for some constant C\ > 0. For any integer t, let 

I&VaWI a * KVa(s)\ 

Yj = max max .. r . ... and Y+ = max max ,, T - . . ., , 
\A\=t\\U Ah h=l,l<k<t ||Va(s)|| 2 |A|=t||t7 Ajk ||a=l,l<*<* ||VA(s)|| 2 

where V A (S A ) = g n {Z A (Z> A Z A )- l S A -(I- P A )X(3 for N(A) = q 1 = m>0, 
S A = (S' Al ,...,S' Am )', S Ak =\ y fd A ~U Ak and 11*7^112 = 1. 
For a sufficiently large constant C 2 > 0, define 



n to = {(Z,e n ):x t < aC 2 \J (t\/l)m n log (pm n ),Vi > t } 

and 



ftl = {(Z,e n ) :x* t < <TC 2 V(*Vl)m n log(pm n ),Vt > to}, 
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where to > 0. 

As in the proof of Theorem 2.1 of Wei and Huang (2008), 

(Z,e n )efi, => \A x \<M iq 

for a constant M\ > 1. By the triangle and Cauchy-Schwarz inequalities, 

rifrt I&VaW = \e' n V A (s) + S' n V A (s)\ \e' n V A (s)\ 

{ ' \\V A (*)h HVkWIla " ||Va|| 2 " 

In the proof of Theorem 2.1 of Wei and Huang (2008), it is shown that 

( 17 ) P ^)>2-^-exp(-^)^l. 
Since 

|dn.VA(s)l , .. <Cl0n l/(2(2d+l)) 

Fa(s)|| 2 

and m n = 0(n l '^ d+l '), we have for all t > and n sufficiently large, 

(18) ||^|| 2 < dqn 1 '^ 2 ^ < <7C 2V /(tVl)m n log(p). 

It follows from (16), (17) and (18) that P(Oo) — > 1. This completes the proof 
of part (i) of Theorem 1. 

Before proving part (ii), we first prove part (iii) of Theorem 1. By the 
definition of /3 n = nl , P np )', 

v p 

(19) ||Y - Z/3 n ||| + A nl WPnjh < IIY - Zf3 n \\l + X nl WPnjh- 

3=1 3=1 

Let A 2 = {j : \\/3 n j\\ 2 ¥= or \\^ nj \\ 2 ¥= 0} and d n2 = \A 2 \. By part (i), d n2 = 
O p (q). By (19) and the definition of A 2 , 

|| Y - Z A2 (3 nA2 \\l + A n i ^ IIAylb 

(20) 

< || Y - Z A2 (3 nA J\l + X n i ^ Pnjlb- 

jeA 2 

Let ?7 n = Y — Z/3 n . Write 

Y - Z A2 (3 nA2 = Y - Z(3 n - Z A2 ((3 nA2 - (3 nA2 ) =rj n — Z A2 ((3 nA2 - P nA2 )- 
We have 

II Y - Z A J} nM ||| = \\Z A2 nA2 - f3 nM )\\l - 2r)' n Z A2 (j3 nA2 - f3 nA2 ) + nj n r) n . 
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We can rewrite (20) as 

W Z A 2 (P n A 2 ~ PnA 2 )\\l ~ 2r 7n Z ^2 (PnA 2 ~ PnA 2 ) 



(21) 



Now 



(22) 



< A n i \\Pnjh ~ Kl ll/^njlb- 



Let v n = Z A2 (/3 nA - /3 nA J. Combining (20), (21) and (22) to get 



(23) IKH 2 - 2 V ' n u n < X nl ^/\M\- \\(3 nA2 -(3 nA2 \\ 2 . 

Let r)* n be the projection of r\ n to the span of Z Aa , that is, rf n = Z Aa (Z Aa x 
Z J 4 2 ) _1 Z A2 /7 n . By the Cauchy-Schwarz inequality, 

(24) 2|i£i/ n | < 2||<|| 2 • \\u n \\ 2 < 2IKH 2 . + IlKUl- 
From (23) and (24), we have 

Kill < 4|K||! + 2\ nl ^/\M\ ■ \\P nM - (3 nM || 2 . 

Let c n * be the smallest eigenvalue of Z A2 Z A2 /ra. By Lemma 3 and part (i), 
c n * ~ p m~ l . Since \\v n \\l > nc n *\\f3 nA2 - (3 nA2 \\l and 2ab<a 2 + b 2 , 



nCn*\\PnA 2 ~ PnA 2 111 < 4|K||| + + 5™V* 110*4, " /Wl" 

It follows that 

2 



Let /o(Xj) = YJj=i hj{Xij) and /oa(X;) = YljeA foj( x ij)- Write 
^ = Yi - ii - / (Xi) + (/i - F) + / (X<) - E ^Aj 

= £i + ( M - F) + / Aa (X,) - f nA2 (X<). 
Since |/x — Y| 2 = O p (n _1 ) and ||/oj — /njlloo — 0(m~ d ), we have 
(26) IKH 2 < 2||e;|| 2 + O p (l) + 0(nd n2 m- 2d ), 
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where £* is the projection of e n = (e±, . . . ,e n )' to the span of Za 2 - We have 

Il e nll2 = ll( Z A 2 Z ^2) 1 ^ 2Z A 2 £ nll2 - — — W^M^Wl- 

Now 

max ||Z / 4e n ||n= max >^ ||Z'-e n ||o < d n 2in n max \Z' iIc .e\ 2 , 

A:\A\<d n2 A:\A\<d n i / —'. J l<i<P,l<fc<m„' 3K 

j&A 



where Z jk = (tp^X^), . . .,ip k (X n j))'. By Lemma 2 

rem" 1 max 

Opi^nm' 1 \og{jpm 



max \Z[ h e n \ 2 = nm„ 1 max \(m n /n) 1 ' 2 Z' ik .e n \ 
l<j<p,l<k<m n 3 l<i<P,l<fc<m„ lv ' ; Jh 1 



*n5 _/n r ^d n2 log{pm T 



It follows that, 

(27) \K\\'i = o p (i) 

Combining (25), (26) and (27), we get 

' d n2 \og(pm n )\ ( 1 



f0 A2 -p A ji<o t ( d ^r- y )+o t 

\ nc n* J 



nc 



II* 



' d n2 m- 2d \ | 4A^|Ai| 



Since d n2 = O p (q), c nif x p m" 1 and c* x p m" 1 , we have 

113* -ft,J!< o P (=i*)) +0p (^) +o(-^ T ) +( / 4 "«> 

This completes the proof of part (iii). 

We now prove part (ii). Since ||/j|| 2 > c f > 0, 1 < j < q, \\fj - f nj \\ 2 



n 



0(m n d ) and ||/„j|| 2 > \\fjh - \\fj - fnjh, we have \\f n jh > 0.5c f for 
sufficiently large. By a result of de Boor (2001), see also (12) of Stone (1986), 
there are positive constants cq and cj such that 

czm~ x \\P n \\l < WfnjWl < c 7 m- l \\(3 nj \\l 

It follows that ||/3 n j||2 > Cj l m n \\f n j\\ 2 > 0.25cf 1 c 2 m n . Therefore, if ||/3 n j||2 / 
but ||/3 nj -||2 = 0, then 

(28) nj -f3 nj \\ 2 >O.25c^c}m n . 

However, since (m n log(pm n )) / n — > and {\ 2 ll m n )/n 2 — >, (28) contradicts 
part (iii). □ 
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Proof of Theorem 2. By the definition of fj,l<j< p, parts (i) and 
(ii) follow from parts (i) and (ii) of Theorem 1 directly. 

Now consider part (iii). By the properties of spline [de Boor (2001)], 

c&m- l \\^ nj - (3 nj \\l < \\fnj - f nj \\l < cjm^WP^ - (3 nj \\j. 

Thus, 

II f _ f 112 -n ( m n^og(pm n ) \ (I 
WJnj Jnj\\2 — U p\ n I P \n 

(29) 



m n i J \ n 2 
By (A3), 

(30) \\fj-fnj\\l = 0{m- 2d ). 

Part (iii) follows from (29) and (30). □ 

In the proofs below, for any matrix H, denote its 2-norm by ||H||, which 
is equal to its largest eigenvalue. This norm satisfies the inequality ||Hx|| < 
||H||||x|| for a column vector x whose dimension is the same as the number 
of the columns of H. 

Denote (3 nAl = ((3' nj ,j £Ai)', /3 nAl = (f3 nj ,j G ^i)' and Z Al = (Zj,j G 
A\). Define C Al = n Z A Z Al . Let p n \ and p n2 be the smallest and largest 
eigenvalues of C Al , respectively. 

Proof of Theorem 3. By the KKT, a necessary and sufficient condi- 
tion for (3 n is 



(31) 



2Z;.(Y - Z/3J = \ n2 w nj -^- \\0jh + 0,j > 1, 

WPnj II 

k 2||Z;(Y - z3j|| 2 < X n2 w nj , nj \\ = 0, j > 1. 

Let u n = (w nj ^ j /{20 nj \\),j e A x )'. Define 
(32) P nAl = {Z' Ai Z Al T\Z' M Y - X n2 v n ). 

If (3 nAl =o (3 nAl , then the equation in (31) holds for (5 n = ((3 nA 0')'. 

Thus, since Z(3 n = Z Al (3 nAl for this (3 n and {Zj,j G Ai} are linearly inde- 
pendent, 



Pn —0 P n if 



||Z;(Y - Z A 3 nM )\\ 2 < X n2 w nj /2, Vj i Al 
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This is true if 

q = q if f WPnjh- WPnj h < WPnjh, Vj € A U 

U ° " 1 t ||z;.(Y-Z A A A J|| 2 <A n2 ^-/2, Vj^x. 
Therefore, 

P(3n 7^0 /3„) < P(||3„, " P nj h > \\P nj h, 3j G Ai) 

+ P(||Z;-(Y - Z Al 3 nAi )|| 2 > A n2 ^-/2,3i £ Ai). 

Let f j(Xj) = (foj(X lj ),...,f 0j (X nj )) / and 8 n = EyeAi /oi( x j) ~ 
By Lemma 1, we have 

(33) n- 1 ||5 n || 2 = O p ( (/ m- 2<i ). 
Let H n = I n -Z Al (Z' Ai Z Al )- 1 Z^ 4i . By (32), 

(34) P nAl ~ PnA, = « _1 C^(Z^( en + <$„) - \ n2 u n ) 
and 

(35) Y - Z Al P nAi = H n e n + H n <5„ + A n2 Z Al C^i/ n /n. 
Based on these two equations, Lemma 5 below shows that 

P(0 n , - PnjW* > WPnj \\2, 3j € Ax) -> 0, 

and Lemma 6 below shows that 

P(||Z;.(Y - Z Al P nAl )h > Xn2w nj /2,3j i A t ) -> 0. 

These two equations lead to part (i) of the theorem. 

We now prove part (ii) of Theorem 3. As in (26), for r\ n = Y — Z/3 n and 

Vni = z ^i( Z Ai Z A!) - Z' Al r7 n , 

we have 

(36) ||r ? ; i ||2<2||e: i i + O p (l)+0(( 7 nm- M ), 

where e*j is the projection of e n = (e±, . . . , e n )' to the span of Z Al . We have 

( 37 ) IK1II2 = ll(ZA 1 ZA 1 )~ 1/2 Z / Al e n || 2 < -^—\\Z' Ai £ n \\l = O p {l)^-. 

n Pnl Pnl 

Now similarly to the proof of (25), we can show that 

( oo\ ||3 a ||2 ^ 8IK1II2 1 4A »2l^ll 

l^oj PnAi"PnAi 2i 1 o~2 ■ 

npni n Pn\ 
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Combining (36), (37) and (38), we get 

l/U -<Wi = *(4) +°,(^) +0 (^=t) +°(|f ) 

Since p n i x^m^ 1 , the result follows. □ 

The following lemmas are needed in the proof of Theorem 3. 

Lemma 4. For v n = (w n j0j/(2\\0 n j\\),j & A\)' , under condition (Bl), 

KIP = O p (h 2 n ) = p {(b 2 n ic b )- 2 r- 1 + qb~l). 
Proof. Write 

kii 2 = E »?=E n^ir 2 = E j^w€ + E HA.J- 1 - 

j'eAi jeAi jeAi WnjW WPnjW jgAi 

Under (B2), 

V lll/3nil|2 ~^ rajl121 < Mc- b 2 b- n t\\P n - /3J| 

and X^jeAi IIAv/ll" 2 — Q^nl ■ The c l a i m follows. □ 

Let p n 3 be the maximum of the largest eigenvalues of n^Z'jZj, j € ^Lj) 
that is, /3 n 3 = maxj gy 4 \\n ZjZj ||2- By Lemma 3, 

(39) 6„ixO(my 2 ), p n i Xpm" 1 , Pra-p™' 1 and p^^m" 1 . 



Lemma 5. Under conditions (Bl), (B2), (A3) and (A4), 

(40) P(||3ni -P nj h > WPnjh^j e A) ->o. 

Proof. Let T nj - be an m n x gm n matrix with the form 

Try = {0 m „ -i ■ ■ ■ j m?l , I mn , mn , . . . , m?l ) , 

where O m „ is an m n x m n matrix of zeros and l mn is an m n x m n identity 
matrix, and I mn is at the jth block. By (34), (3 n j — (3 n j = n~ 1 T n jC^(Z' Ai s n + 
^Ai^« ~~ \i2 u n)- By the triangle inequality, 

WPnj — Pnjh < n ~ 1 \\ r ^njC A 1 i Z' Ai e n \\2 + n _1 \\T n jC A ^Z' Al S n || 2 

(41) 

n i^Ai l/n ll 2, 
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Let C be a generic constant independent of n. The first term on the right- 
hand side 

maxn _1 ||T nj -C^Z^ 1 e r i||2 < rT 1 p~l\\Z' M e n \\2 

(42) =n^ 2 P ~l\\n^Z' Ai e n \\ 2 

= O p (l)n-V2 p -i m -i/2 (gmn) V2. 

By (33), the second term 

m.axn~ 1 \\T n jC^ i Z' Ai 8 n \\ 2 < ||C^|| 2 • \\n~ 1 Z Al Z Al \\ 1 2 /2 ■ ||ra _1 (5 n ||2 

(43) 

= O p {l)p-lp 1 £q l l 2 m- d . 

By Lemma 4, the third term 

(44) maxn _1 A n2 ||T n: ,C7V n ||2 < nX n2 p~l\\v n \\ 2 = Op^p'^n' 1 X n2 h n . 
Thus, (40) follows from (39), (42)-(44) and condition (B2a). □ 

Lemma 6. Under conditions (Bl), (B2), (A3) and (A4), 

(45) P(||Z;.(Y - Z Al 3 nAl )|| 2 > X n2 w nj /2, 3j <£ A x ) 0. 

PROOF. By (35), we have 

(46) Z^Y - Z A J nAl ) = Z^H re £ n + Z'jHnSn + ArT^-Z^C^n- 

Recall s n = p — q is the number of zero components in the model. By Lemma 
2, 

(47) E(max\\n- 1 ^Z' j H n e n \\ 2 ) < 0(l){log(s n mn)} 1/2 - 

Since = ||/3 n j|| _1 = O p (r n ) for j ^ Ai and by (47), for the first term on 
the right-hand side of (46), we have 

P(||Z;.H n e n || 2 > X n2 w nj /6,3j £ A x ) 

< P(||Z' H n e n || 2 > CX n2 r n , 3j £ A x ) + o(l) 

(48) 

= pfmaxllrr^Z'H n e n \\ 2 > Cn-^ 2 X n2 r n ) + o(l) 

<o(i) nl/2{1 °g (g " m - )}1/2 +0 (i). 

CA n2 r n 
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By (33), the second term on the right-hand side of (46) 

maxllZ'-EL^yjIU < re 1 / 2 max||ra~ 1 Z'-Z,-||9 • ||HL|| 2 ■ ll^nlb 

(49) 

By Lemma 4, the third term on the right-hand side of (46) 
max A„ 2 n~ 1 1 1 Zj Z Al C7 1 v n || 2 

(50) < Anamaxlln-^Zj-lla ■ Hn-^Z^C , 1/2 || 2 • ||C Ai 1/2 || 2 • |K|| 2 

Therefore, (45) follows from (39), (48), (49), (50) and condition (B2b). □ 

Proof of Theorem 4. The proof is similar to that of Theorem 2 and 
is omitted. □ 
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